Draft

Credit Score

NLP
code
ML
Python, R
Author

Simone Brazzi

Published

October 26, 2024

1 Introduction

Create a model capable of estimating the creditworthiness of customers, in order to help the dedicated team understand whether or not to accept a credit card application.

1.1 Features

  • ID: customer identification number
  • CODE_GENDER: gender of the customer
  • FLAGOWNCAR: indicator of car ownership
  • FLAGOWNREALTY: indicator of house ownership
  • CNT_CHILDREN: number of children
  • AMTINCOMETOTAL: annual income
  • NAMEINCOMETYPE: type of income
  • NAMEEDUCATIONTYPE: level of education
  • NAMEFAMILYSTATUS: marital status
  • NAMEHOUSINGTYPE: type of dwelling
  • DAYS_BIRTH: number of days since birth
  • DAYS_EMPLOYED: number of days since employment (if positive, indicates the number of days since being unemployed)
  • FLAG_MOBIL: indicator for the presence of a mobile phone number
  • FLAGWORKPHONE: indicator of the presence of a work phone number
  • FLAG_PHONE: indicator of the presence of a telephone number
  • FLAG_EMAIL: indicator for the presence of an email address
  • OCCUPATION_TYPE: type of occupation
  • CNTFAMMEMBERS: number of family members
  • TARGET: variable which is worth 1 if the customer has a high creditworthiness (constant payment of installments), 0 otherwise.

If a customer is denied a credit card, the team must be able to give a reason. This means that your model must provide indications that are easy to interpret.

It is a binary classification which needs a good explainability.

2 Import

2.1 R

Code
library(tidyverse, verbose = FALSE)
library(ggplot2)
library(plotly)
library(reticulate)
library(gt)
library(scales)

2.2 Python

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle

from datetime import date
from dateutil.relativedelta import relativedelta
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler, LabelBinarizer
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

2.3 Config class

Code
class Config():
  
  def __init__(self):
    """
    Initialization calss.
    """
    
    self.path="/Users/simonebrazzi/R/blog/posts/credit_score/"
    self.file="/Users/simonebrazzi/R/blog/posts/credit_score/credit_scoring.csv"
    self.random_state=42
    self.col_bi = ['code_gender', 'flag_own_car', 'flag_own_realty', 'flag_mobil', 'flag_work_phone', 'flag_phone', 'flag_email']
    self.col_ohe = ["name_income_type", "name_education_type", "name_family_status", "name_housing_type", "occupation_type"]
    self.col_ss = ['cnt_children', 'amt_income_total', 'cnt_fam_members']
    self.features = self.col_bi + self.col_ohe + self.col_ss

  
config = Config()

3 Dataset

Now we can import the dataset. Having all columns UPPERCASE, I will set all of them as lowercase. Also, I will assign target as the y variable and all the remaining columns as x or features.

Code
df = pd.read_csv(config.file)
df.columns = df.columns.str.lower()
df["id"] = df.id.astype(int)

4 EDA

4.1 Target

It is always important to perform an EDA. In this case I am expecting to have less cases of high creditworthiness than low leveraging personal knowledge. Xonsidering it is a binary classification, it is important to be sure if data are balanced to make a better model selection.

The variable:

  • 0 has 308,705.
  • 1 has 29,722.
Code
df <- py$df

df_g <- df %>% 
  rename(label = target) %>% 
  mutate(label = as.factor(label)) %>% 
  summarize(
    freq_abs = n(),
    freq_rel = n() / nrow(df),
    .by = label
  )
df_g %>% 
  gt() %>% 
  fmt_auto() %>% 
  cols_width(
    label ~ pct(20),
    freq_abs ~ pct(35),
    freq_rel ~ pct(35)
    ) %>% 
  cols_align(
    align = "center",
    columns = c(freq_abs, freq_rel)
  ) %>% 
  tab_header(
    title = "Label frequency",
    subtitle = "Absolute and relative frequencies"
  ) %>% 
  cols_label(
    label = "Label",
    freq_abs = "Absolute frequency",
    freq_rel = "Relative frequency"
  ) %>% 
  tab_options(
    table.width = pct(100)
    )
Table 1: Absolute and relative frequencies
Label frequency
Absolute and relative frequencies
Label Absolute frequency Relative frequency
0 308,705  0.912
1  29,722  0.088
Code
g <- ggplot(df_g) +
  geom_bar(aes(x = label, y = freq_abs, fill = label), stat = "identity") +
  ggtitle("Absolute frequency barchart") +
  xlab("Label") +
  ylab("Absolute frequency") +
  labs(fill = "Label") +
  scale_y_continuous(breaks = scales::breaks_pretty(n = 5)) +
  theme_minimal()
ggplotly(g)
Figure 1: Absolute frequency barchart

Remember: 0. stands for low creditworthiness, 1 for high. Our dataset is strongly unbalanced towards the 0 label. So, my first idea was right.

4.2 Features

Lets check the NA values in each feature.

Code
df_na <- df %>%
  summarize(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(
    cols = everything(),
    names_to = "columns",
    values_to = "value"
  )
df_na %>% 
  gt() %>% 
  tab_options(
    table.width = pct(100)
    ) %>% 
  cols_align(align = "center") %>% 
  cols_label(
    columns = "Columns",
    value = "NA Value"
  ) %>% 
  tab_header(
    title = "NA values",
    subtitle = "Number of NA value for each feature"
  ) %>% 
  fmt_number(
    columns = c("value"),
    decimals = 2,
    drop_trailing_zeros = TRUE,
    #drop_trailing_dec_mark = FALSE
  ) 
Table 2: Number of NA values for each feature
NA values
Number of NA value for each feature
Columns NA Value
id 0
code_gender 0
flag_own_car 0
flag_own_realty 0
cnt_children 0
amt_income_total 0
name_income_type 0
name_education_type 0
name_family_status 1
name_housing_type 1
days_birth 1
days_employed 1
flag_mobil 1
flag_work_phone 1
flag_phone 1
flag_email 1
occupation_type 103,342
cnt_fam_members 1
target 0

There are 2 interesting things to notice:

The column occupation_type has lots of NA value. How much in proportion to the total? Should we drop the NAs or keep them?

Code
df %>%
  summarize(
    freq_abs = sum(is.na(occupation_type)),
    freq_rel = sum(is.na(occupation_type)) / nrow(df)
    ) %>% 
  gt() %>% 
  tab_options(
    table.width = pct(100)
    ) %>% 
  cols_align(align = "center") %>% 
  cols_label(
    freq_abs = "Absolute Frequency",
    freq_rel = "Relative Frequency"
  ) %>% 
  tab_header(
    title = "Absolute and Relative Frequency",
    subtitle = "NA frequency for occupation_type"
  ) %>% 
  fmt_number(
    columns = c("freq_abs", "freq_rel"),
    decimals = 2,
    drop_trailing_zeros = TRUE
  ) 
Table 3: NA frequency for occupation_type
Absolute and Relative Frequency
NA frequency for occupation_type
Absolute Frequency Relative Frequency
103,342 0.31

30.54% of occupation_type are NA. This could be a big concern: how to handle it? Drop it means losing lots of data, keeping it could lead to impute wrong data, which could be worse.

Lets check how the occupation_type labels frequency.

Code
df %>% 
  summarise(
    n = n(),
    .by = occupation_type
  ) %>% 
  arrange(desc(n)) %>% 
  gt() %>% 
  tab_options(
    table.width = pct(100)
    ) %>% 
  cols_align(align = "center") %>% 
  cols_label(
    occupation_type = "Occupation Type",
    n = "Frequency"
  ) %>% 
  tab_header(
    title = "Occupation Type Label Frequency"
  ) %>% 
  fmt_number(
    columns = c("n"),
    decimals = 2,
    drop_trailing_zeros = TRUE
  ) 
Table 4: Occupation type label frequency
Occupation Type Label Frequency
Occupation Type Frequency
NA 103,342
Laborers 60,146
Core staff 33,527
Sales staff 31,652
Managers 27,384
Drivers 20,020
High skill tech staff 13,399
Accountants 12,281
Medicine staff 10,438
Cooking staff 6,248
Security staff 6,218
Cleaning staff 4,594
Private service staff 2,787
Low-skill Laborers 1,714
Secretaries 1,577
Waiters/barmen staff 1,245
Realty agents 852
HR staff 567
IT staff 436

So, here is the situation: we could impute the NA respecting the proportion of the other labels in the dataset or we can drop the NA values. I want to drop the NA. Even if we are losing a third of the data, we are not making assumption about the rest. Also, we still have 235085, which are plenty enough.

Code
df_occ_type <- df %>% 
  filter(!is.na(occupation_type)) %>% 
  summarise(
    n = n(),
    .by = occupation_type
  ) %>% 
  arrange(desc(n))

# to extend the palette to the number of labels
extended_palette <- colorRampPalette(RColorBrewer::brewer.pal(9, "Purples"))(df_occ_type %>% nrow())

g <- ggplot(df_occ_type) +
  geom_col(aes(x = n, y = reorder(occupation_type, n), fill = reorder(occupation_type, n))) +
  theme_minimal() +
  labs(
    x = "Count",
    y = "Labels",
    title = "Occupation Type Label Frequency"
  ) +
  scale_fill_manual(values = extended_palette)

ggplotly(g)
Figure 2: Barplot Occupation Type Label frequency

There is a NA value in at least all the columns: is it possible it is the same observation?

First of all, we can create a list of the columns name which have only 1 NA, using the df_na tibble.

Code
one_cols <- df_na %>% 
  filter(value == 1) %>% 
  pull(columns)
one_cols
[1] "name_family_status" "name_housing_type"  "days_birth"        
[4] "days_employed"      "flag_mobil"         "flag_work_phone"   
[7] "flag_phone"         "flag_email"         "cnt_fam_members"   

Then, to check multiple OR condition, we can use if_any. In this case, we are checking if multiple columns are NA, with the previous filter on only one NA in each column.

Code
library(reticulate)

df %>% 
  filter(
    if_any(all_of(one_cols), ~ . %>% is.na())
  ) %>% 
  gt()  %>% 
  tab_options(
    table.width = pct(100)
    )
id code_gender flag_own_car flag_own_realty cnt_children amt_income_total name_income_type name_education_type name_family_status name_housing_type days_birth days_employed flag_mobil flag_work_phone flag_phone flag_email occupation_type cnt_fam_members target
6392180 F N N 0 67500 Working Secondary / se NA NA NaN NaN NaN NaN NaN NaN NA NaN 0
Code
py$id_to_remove <- df %>% 
  filter(
    if_any(one_cols, ~ . %>% is.na())
  ) %>% 
  pull(id)
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
  # Was:
  data %>% select(one_cols)

  # Now:
  data %>% select(all_of(one_cols))

See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.

Interesting features

Lets also look at some interesting variable: click on the different tabs to see the frequency table for each of them.

Code
# features to check
cols <- c(
  "cnt_children",
  "name_education_type",
  "name_family_status",
  "name_housing_type",
  "cnt_fam_members",
  "flag_own_car",
  "code_gender"
  )
# 
count_results <- map(cols, ~ df %>% filter(!is.na(occupation_type)) %>% count(.data[[.x]]))
names(count_results) <- cols
Code
count_results$cnt_children %>% 
  arrange(desc(n)) %>% 
  gt() %>% 
  tab_options(
    table.width = pct(100)
    ) %>% 
  cols_align(align = "center") %>% 
  cols_label(
    cnt_children = "# Children",
    n = "Count"
  ) %>% 
  tab_header(
    title = "Count of Number of Children",
  ) %>% 
  fmt_number(
    columns = c("n"),
    decimals = 2,
    drop_trailing_zeros = TRUE,
    #drop_trailing_dec_mark = FALSE
  ) 
Table 5: Count of Number of Children
Count of Number of Children
# Children Count
0 148,909
1 56,238
2 26,196
3 3,319
4 300
5 109
9 4
12 4
14 3
7 2
19 1
Code
count_results$name_education_type %>% 
  arrange(desc(n)) %>% 
  gt()  %>% 
  tab_options(
    table.width = pct(100)
    ) %>% 
  cols_align(align = "center") %>% 
  cols_label(
    name_education_type = "Education Type",
    n = "Count"
  ) %>% 
  tab_header(
    title = "Count by Education Type",
  ) %>% 
  fmt_number(
    columns = c("n"),
    decimals = 2,
    drop_trailing_zeros = TRUE,
    #drop_trailing_dec_mark = FALSE
  ) 
Table 6: Count by Education Type
Count by Education Type
Education Type Count
Secondary / secondary special 157,822
Higher education 66,693
Incomplete higher 8,799
Lower secondary 1,609
Academic degree 162
Code
count_results$name_family_status %>% 
  arrange(desc(n)) %>% 
  gt()  %>% 
  tab_options(
    table.width = pct(100)
    ) %>% 
  cols_align(align = "center") %>% 
  cols_label(
    name_family_status = "Family Status",
    n = "Count"
  ) %>% 
  tab_header(
    title = "Count by Family Status",
  ) %>% 
  fmt_number(
    columns = c("n"),
    decimals = 2,
    drop_trailing_zeros = TRUE,
    #drop_trailing_dec_mark = FALSE
  ) 
Table 7: Count by Family Status
Count by Family Status
Family Status Count
Married 164,310
Single / not married 30,551
Civil marriage 20,968
Separated 14,108
Widow 5,148
Code
count_results$name_housing_type %>% 
  arrange(desc(n)) %>% 
  gt()  %>% 
  tab_options(
    table.width = pct(100)
    ) %>% 
  cols_align(align = "center") %>% 
  cols_label(
    name_housing_type = "House Type",
    n = "Count"
  ) %>% 
  tab_header(
    title = "Count by House Type",
  ) %>% 
  fmt_number(
    columns = c("n"),
    decimals = 2,
    drop_trailing_zeros = TRUE,
    #drop_trailing_dec_mark = FALSE
  ) 
Table 8: Count by House Type
Count by House Type
House Type Count
House / apartment 208,980
With parents 11,938
Municipal apartment 7,411
Rented apartment 3,549
Office apartment 2,270
Co-op apartment 937
Code
count_results$cnt_fam_members %>% 
  arrange(desc(n)) %>% 
  gt()  %>% 
  tab_options(
    table.width = pct(100)
    ) %>% 
  cols_align(align = "center") %>% 
  cols_label(
    cnt_fam_members = "Number of Family Members",
    n = "Count"
  ) %>% 
  tab_header(
    title = "Count by Number of Family Members",
  ) %>% 
  fmt_number(
    columns = c("n"),
    decimals = 2,
    drop_trailing_zeros = TRUE,
    #drop_trailing_dec_mark = FALSE
  ) 
Table 9: Count by Number of Family Members
Count by Number of Family Members
Number of Family Members Count
2 118,852
3 49,232
1 38,902
4 24,556
5 3,138
6 282
7 109
11 4
14 4
15 3
9 2
20 1
Code
count_results$flag_own_car %>% 
  arrange(desc(n)) %>% 
  gt()  %>% 
  tab_options(
    table.width = pct(100)
    ) %>% 
  cols_align(align = "center") %>% 
  cols_label(
    flag_own_car = "Own Car",
    n = "Count"
  ) %>% 
  tab_header(
    title = "Count by Own a Car",
  ) %>% 
  fmt_number(
    columns = c("n"),
    decimals = 2,
    drop_trailing_zeros = TRUE,
    #drop_trailing_dec_mark = FALSE
  ) 
Table 10: Count by Own a Car
Count by Own a Car
Own Car Count
N 138,203
Y 96,882
Code
count_results$code_gender %>% 
  arrange(desc(n)) %>% 
  gt() %>% 
  tab_options(
    table.width = pct(100)
    ) %>% 
  cols_align(align = "center") %>% 
  cols_label(
    code_gender = "Gender",
    n = "Count"
  ) %>% 
  tab_header(
    title = "Count by Gender",
  ) %>% 
  fmt_number(
    columns = c("n"),
    decimals = 2,
    drop_trailing_zeros = TRUE,
    #drop_trailing_dec_mark = FALSE
  ) 
Table 11: Count by Gender
Count by Gender
Gender Count
F 147,505
M 87,580

5 Data preparation

From the EDA, we know we can drop

  • NA occupation_type.
  • the id 6.39218^{6}.
Code
df = df[df.id != id_to_remove]
df = df[~df.occupation_type.isna()]

It could be useful to have the birthday and employed date not as integer, but proper date. We will check after if they are really useful.

Code
df["birthday"] = df.days_birth.map(lambda x : (date.today() + relativedelta(days=x)).strftime("%Y-%m-%d"))
df["employed"] = df["days_employed"].apply(lambda x: (date.today() + relativedelta(days=x)).strftime("%Y-%m-%d") if x < 0 else "0")

6 Split

To take into consideration the unbalance and have it also in our split, we can use the stratify method.

Code
x = df[config.features]
y = df.target

xtrain, xtest, ytrain, ytest = train_test_split(
  x, 
  y, 
  test_size = .2,
  stratify = y
)

7 Models

7.1 DecisionTreeClassifier

To address the model explainability, we can choose a CART model. This selection has 2 interesting point for our task:

  1. Explainability, as already mentioned. This will help us explain the reason of rejection.
  2. Feature selection. At this link you can find a paper which shows how embedded method as the tree models can have the same performance of traditional feature selection method.

7.1.1 GridSearchCV

Even if the result of the model are relatively easy to understand, the hyperparameter tuning has far too many possibility. This is where GridSearchCV() comes handy.

Code
param_grid = {
  "dtc__criterion" : ["gini", "entropy", "log_loss"],
  "dtc__splitter" : ["best", "random"],
  "dtc__max_depth" : [2, 5, 10],
  "dtc__max_features" : [len(config.features)]
}

scoring = {
    "accuracy": "accuracy",
    "f1": "f1",
    "roc_auc": "roc_auc"
}

7.1.2 Pipeline

One of the many nice sklearn features is the Pipeline(). It has lots of important benefits:

  • Code redundancy and reproducibility : it is like a recipe you can tune and run without worrying about to put togheter each step in the correct order.
  • Avoid data leakege: the preprocessing steps applies only to the training dataset.
  • Hyperparameter tuning: the Pipeline() integrates with GridSearchCV() and RandomizedSearchCV().
  • Modularity: it enables tuning the various steps, removing or adding new one.

First, initialize the various preprocessor and the CART model.

Code
ohe = OneHotEncoder()
ohe_bi = OneHotEncoder(drop="if_binary")
ss = StandardScaler()
mms = MinMaxScaler()
dtc = DecisionTreeClassifier(class_weight="balanced")

Using the ColumnTransformer(), we can apply the different preprocessing to the specific columns. The preprocessing is divided between:

  • Encoding the binary labels to 0-1 using OneHotEncoding(drop="if_binary)".
  • Encoding the remaining labels using OneHotEncoding(). I choose this over other categorical variables encoding because it avoid imputing a hierarchy.
  • Standardize the numerical variables using StandardScaler(). I also have instantiated the MinMaxScaler(), but I am not using it.

The classifier is a DecisionTreeClassifier(), which the GridSearchCV() will tune.

All of these steps are putted togheter with the Pipeline().

Code
preprocessor = ColumnTransformer(
  transformers=[
    ("binary_encoder", ohe_bi, config.col_bi),
    ("categorical_encoder", ohe, config.col_ohe),
    ("scaler", ss, config.col_ss)
    ],
    remainder='passthrough'
    )

pipe_dtc = Pipeline(
  steps=[
    ("preprocessor", preprocessor),
    ("dtc", dtc)
    ]
    )
Important

For the binary labels I had to use the OneHotEncoding(drop="if_binary"), otherwise other preprocessor would not work with the ColumnTransformer().

7.1.3 Fit

Code
grid_search_dtc = GridSearchCV(
    estimator=pipe_dtc,
    param_grid=param_grid,
    cv=7,
    scoring=scoring,
    refit="accuracy",  # This will refit the model using the accuracy metric
    n_jobs=-1 # use all the available CPU
    )

grid_search_dtc.fit(xtrain, ytrain)
# save the model as pkl file
with open(config.path + "grid_search_dtc.pkl", "wb") as file:
  pickle.dump(grid_search_dtc, file)
Code
with open(config.path + "grid_search_dtc.pkl", "rb") as file:
    dtc_model = pickle.load(file)


cv_results = pd.DataFrame(dtc_model.cv_results_)
model = dtc_model.best_estimator_
Code
py$cv_results %>% 
  arrange(across(contains("rank"))) %>% 
  select(params, mean_test_accuracy, rank_test_accuracy, mean_test_f1, rank_test_f1, mean_test_roc_auc, rank_test_roc_auc) %>% 
  gt() %>%
  tab_header(
    title = "Cross Validation Results",
    subtitle = "Main metrics"
  ) %>% 
  fmt_number(
   columns = c("mean_test_accuracy", "mean_test_f1", "mean_test_roc_auc"),
   decimals = 3,
   drop_trailing_zeros = TRUE,
   drop_trailing_dec_mark = FALSE
  ) %>% 
  cols_align(
    align = "center"
  ) %>% 
  cols_label(
    params = "criterion - max_depth - max_features - splitter",
    mean_test_accuracy = "Mean Test Accuracy",
    rank_test_accuracy = "Rank Test Accuracy",
    mean_test_f1 = "Mean Test F1",
    rank_test_f1 = "Rank Test F1",
    mean_test_roc_auc = "Mean Test ROC AUC",
    rank_test_roc_auc = "Rank Test ROC AUC",
  ) %>% 
  tab_options(
    table.width = pct(100)
    )
Table 12: Cross Validation Results
Cross Validation Results
Main metrics
criterion - max_depth - max_features - splitter Mean Test Accuracy Rank Test Accuracy Mean Test F1 Rank Test F1 Mean Test ROC AUC Rank Test ROC AUC
log_loss, 2, 15, random 0.727 1 0.2 18 0.56 17
gini, 10, 15, best 0.707 2 0.403 1 0.863 2
log_loss, 10, 15, best 0.704 3 0.4 2 0.863 1
entropy, 10, 15, best 0.701 4 0.394 3 0.853 3
gini, 2, 15, best 0.677 5 0.31 8 0.705 11
entropy, 5, 15, best 0.667 6 0.368 4 0.82 4
log_loss, 5, 15, best 0.656 7 0.365 5 0.819 5
gini, 5, 15, best 0.651 8 0.348 6 0.797 6
log_loss, 10, 15, random 0.605 9 0.296 9 0.741 7
entropy, 10, 15, random 0.602 10 0.274 12 0.706 10
log_loss, 2, 15, best 0.591 11 0.31 7 0.715 9
gini, 10, 15, random 0.583 12 0.277 11 0.722 8
entropy, 5, 15, random 0.56 13 0.231 15 0.622 15
entropy, 2, 15, best 0.543 14 0.285 10 0.677 12
gini, 5, 15, random 0.541 15 0.234 14 0.633 14
log_loss, 5, 15, random 0.515 16 0.246 13 0.65 13
gini, 2, 15, random 0.513 17 0.207 16 0.567 16
entropy, 2, 15, random 0.341 18 0.203 17 0.549 18

The grid search returns a CART with a best score of best_score with the following parameters:

  • criterion:
  • max_depth:
  • splitter:

Now, we can save and load the model. This is useful to avoid to run the grid_search.

7.1.4 Predict

Code
ypred = model.predict(xtest)

7.1.5 Classification Report

Code
cr = classification_report(
  ytest,
  ypred,
  # target_names=config.labels,
  digits=4,
  output_dict=True
  )
df_cr = pd.DataFrame.from_dict(cr).reset_index()
Code
library(reticulate)
df_cr <- py$df_cr %>% dplyr::rename(names = index)
cols <- df_cr %>% colnames()
df_cr %>% 
  pivot_longer(
    cols = -names,
    names_to = "metrics",
    values_to = "values"
  ) %>% 
  pivot_wider(
    names_from = names,
    values_from = values
  ) %>% 
  gt() %>%
  tab_header(
    title = "Confusion Matrix",
    subtitle = "Threshold optimization favoring recall"
  ) %>% 
  fmt_number(
    columns = c("precision", "recall", "f1-score", "support"),
    decimals = 2,
    drop_trailing_zeros = TRUE,
    drop_trailing_dec_mark = FALSE
  ) %>% 
  cols_align(
    align = "center",
    columns = c("precision", "recall", "f1-score", "support")
  ) %>% 
  cols_align(
    align = "left",
    columns = metrics
  ) %>% 
  cols_label(
    metrics = "Metrics",
    precision = "Precision",
    recall = "Recall",
    `f1-score` = "F1-Score",
    support = "Support"
  ) %>% 
  tab_options(
    table.width = pct(100)
    )
Table 13: Classification report
Confusion Matrix
Threshold optimization favoring recall
Metrics Precision Recall F1-Score Support
0 0.94 0.48 0.63 41,986.
1 0.15 0.76 0.25 5,031.
accuracy 0.51 0.51 0.51 0.51
macro avg 0.55 0.62 0.44 47,017.
weighted avg 0.86 0.51 0.59 47,017.

CLASSIFICATION ANALYSIS

7.2 Random Forest Classifier

7.3 XGBoost Classifier